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Abstract 

Labeled graphs are widely used to model complex data in many domains, so subgraph querying has been attracting more 
and more attention from researchers around the world. Unfortunately, subgraph querying is very time consuming since it 
involves subgraph isomorphism testing that is known to be an NP-complete problem. In this paper, we propose a novel 
coding method for subgraph querying that is based on Laplacian spectrum and the number of walks. Our method follows 
the filtering-and-verification framework and works well on graph databases with frequent updates. We also propose novel 
two-step filtering conditions that can filter out most false positives and prove that the two-step filtering conditions satisfy 
the no-false-negative requirement (no dismissal in answers). Extensive experiments on both real and synthetic graphs show 
that, compared with six existing counterpart methods, our method can effectively improve the efficiency of subgraph 
querying. 
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Introduction 

Labeled graphs, which include both vertex- and edge-labeling, 
have been widely used to model complicated structures and 
schemaless data in many domains such as social network [1,2], 
chemistry [3,4], image analysis [5,6], and XML documents [7,8]. 
This triggers the needs for efFective graph pattern discovery, and 
the most compelling one is subgraph querying. 

The subgraph query problem is to retrieve all the supergraphs 
of a given graph from a graph database. It can be defined as 
follows: for a large graph database D = {Di, D2, Dn} and a 
query graph subgraph query is to fmd all the graphs Df 
(/= l,2,...,m<n) such that Q^is a subgraph of Z)/. Fig. 1 shows an 
example of subgraph query, where the graph database consists of 
graphs D\,D2,D^ and Z)4, and Q^is the query graph. Obviously, 
only graph Dt, contains Q. 

However, it is intractable to fmd all supergraphs of a query 
graph from a large graph database, since subgraph query must 
conduct subgraph isomorphism testing, which is a NP-complete 
problem [9,10]. In order to address this problem, the filtering- and- 
verification framework is commonly adopted by most existing 
methods. These methods first extract some "useful" graph features 
and build indexes for them; then, in the filtering phase, they 
traverse the indexes to prune most false positives and generate the 
candidate graph set; after that, in the verification phase, they 
validate the candidate graphs with subgraph isomorphism testing 
and obtain the answer set. 

Among the existing subgraph query methods, some of them, 
such as GraphGrep [11], gindex [12], FG-Index [13], Treepi 
[14], Tree+delta [15] and Swiftlndex [16], build the inverted 
indexes for features that are substructures extracted from graph 



databases. The path extracted by GraphGrep is too simple and 
leads to low filtering efficiency [12]. Other methods have to re- 
mine frequent substructures and re-build indexes from scratch for 
the databases with frequent updates, so are quite time consuming 
[17]. 

Closure-tree method [18] uses clustering techniques to build 
indexes. It clusters a set of graphs into several groups, and each 
group is referred to as a graph closure. The graph closures are 
then used as nodes to build an index tree. By traversing the index 
tree, this method finds out a disqualified node via the pseudo 
subgraph isomorphism testing, and all graphs contained in this 
node are pruned. As Closure-tree uses the expensive pseudo 
subgraph isomorphism testing to filter out false positives, it costs 
too much time in the filtering phase [19,20]. 

There are subgraph query methods, for example GCoding [17] 
and LsGCoding [21], which use graph coding methods to build 
indexes. These methods extract high-quality features from graphs, 
and map them into numerical space to generate graph codes. For 
a specific feature, if its corresponding code in a query graph is 
greater than that of a graph Z)^, the query graph is not a subgraph 
of graph Df. So, Dt can be filtered as a false positive. According to 
this logic, these methods build indexes based on codes to filter out 
false positives. Moreover, these methods individually encode each 
graph. When the graph database is updated with lots of insertions 
and deletions, these methods do not need to re-compute graph 
codes and re-build the indexes from scratch. However, the subtree 
extracted by GCoding represents partial structure, which degrades 
its filtering efficiency; and Laplacian matrix used in LsGCoding 
only represents graphs with unlabeled edges, which makes 
LsGCoding can only process graphs with unlabeled edges. 
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(a) Graph Di {h) Graph Dj {c) Graph {A) Graph (e) Graph Q 

Figure 1. An Example of Subgraph Query. Four labeled graphs (a) Graph Di, (b) Graph Dj, (c) Graph D^^, and (d) Graph D 4 compose the 
database, and (e) Graph g is a query graph. 
doi:1 0.1 371 /journal.pone.00971 78.g001 



In order to conduct subgraph query on labeled graphs, we 
propose a novel Laplacian spectrum and the number of walks 
based Graph Coding^ (LnGCoding) method by extending 
LsGCoding method. The extended method LnGCoding can 
generate new codes, which include the vertex labels and the labels 
of adjacent edges consisting of the labels of edges, Laplacian 
spectrum, and the number of walks. These are new features and 
not contained in the codes of LsGCoding. Based on the new codes, 
a novel index tree and a novel two-step filtering conditions are 
proposed in LnGCoding. Since the codes contain more informa- 
tion, LnGCoding not only conducts subgraph querying on labeled 
graphs, but also effectively filters out most false positives. 
Moreover, it works well in the databases with frequent updates. 
Extensive experiments on both real and synthetic data show that 
our proposed method LnGCoding can improve the efficiency of 
subgraph query, especially on dense graphs with labeled edges. 

Methods 

In this section, we present the novel coding method and its 
application in subgraph query. At first, we introduce the 
definitions of vertex and graph codes, the properties of graph 
features, and the coding method based on these graph features. 
Then, we state the index building method based on the novel 
graph codes, and provide the filtering conditions generation 
method. Finally, based on the indexes and the filtering conditions, 
we present the filtering-and-verification framework for subgraph 
query. Note that, a labeled graph is abbreviated to a graph in the 
rest of this paper. 

Definitions of Vertex and Graph Codes 

In our method, the vertex and graph codes are based on 
Laplacian spectrum and the number of walks. Therefore, we first 
give the definitions of adjacency matrix, Laplacian matrix and 
spectrum, walk and path. Then, based on these definitions, we 
define the vertex and graph codes. 

Definition 1 (Adjacency Matrix of Graph). Given a graph G with n 
vertices, its adjacency matrix is defined as Mq = (rn(ijy)j^^j^, where 

{1, if vertex v/ is adjacent to vertex vj , 
0, otherwise. 

Definition 2 (Laplacian Matrix and Laplacian Spectrum of Graph). 
Given a graph G with n vertices, its Laplacian Matrix is defined as 
LMG = {l{ij))n^n^ where 



{Deg(vi), if i=j . 

— 1 , if i¥^j and vertex V/ is adjacent to vertex Vj , 
0, otherwise, 

and DegiVi) is the degree of vertex V/. 

All eigenvalues of LMq are called graph G's Laplacian Spectrum. 

Definition 3 (Walk and Path). A walk in graph G consists of a pair 
(V, E) of sequences, where V is a vertex sequence: Vo,Vi,...,Vk, and E is an 
edge sequence: eo,ei,...,ek-i- For i = Q,\,...,k—\, each successive pair 
V/,V/+i of a vertex is adjacent in G, and edge Ci has V/ and V/+i as terminal 
vertices. 

A path is a walk with no repeated edges. 

For a path, no edge occurs more than once in the edge 
sequence. This is different from a walk. The length w of a walk (or 
path) is the number of edges which occur in the walk (or path). 

Definition 4 (Vertex Code). Given a graph G and a vertex veG, the 
vertex code vCode of v is a quadruple: 

vCode(v,G) = <L(v),Ae(v),Laps(v),Nw(v) > , 

where L(y) is a length-ly (ly is a integer) counter string that denotes the vertex 
label of V, Ag(v) is a length-lg (Ig is a integer) counter string that denotes the 
labels of adjacent edges from v, Laps(v) is the Laplacian spectrum of 
neighborhood graph of v, and Nw(v) is a length-ly counter string that denotes 
the number of walks of length W (W is an integer) from v. Note that, the 
counter string is an array of multi-digit counters, where each element counts the 
occurrences of the specified vertices / edges / walks in a graph; And the adjacent 
edge labels of v are two-tuples, consisting of the labels of edges and the label of 
the terminal vertex that is on the same edge as v. 

Fig. 2 shows vCode{y2,Q) of vertex Vj^Q, which occurred in 
Fig. 1 . For the sake of convenience, the first two largest Laplacian 
eigenvalues are used to denote the Laplacian spectrum of each 
vertex, and the length W of walks is set to 2. 

Definition 5 (Graph Code). Given a graph G with n vertices, and that 
vertex code of vertex V/ is denoted as vCode(yi,G)= <L(vi), 
Ae(vi),Laps(Vi),Nw(vi)> , for / = 0,l,...,/i— 1. The graph code 
gCode of G is defined as a quadruple: 

gCode(G) = <L{G\Ae{G)Mpsseqs{G),N^iG) > , 
where L{G),Ae{G),Lapsseqs{G) and N^{G) are defined as follows: 

1. L(G)[/1 = E"="o L(vd\j]j = 0,\,2,-,h-l; 

2. ^e(G)[/1= E"rJ ^e(v,-)[/V=0,l,2,..., 4-1; 
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Figure 2. vCode(v2,Q)* The vertex V2 and graph g both occurred in Figure 1. 
doi:1 0.1 371 /journal.pone.00971 78.g002 



?). Laps seqs{G)j— The ranked Laplacian spectrum Laps{yi)\j\ of all 

vertices with non-ascending order, y = 0, 1 ,/ = 0, 1 ,2, . . . ,/2 — 1 ; 
4. N^{G)\i] = E-rJ iV,,(v,-)[/1^- = 0,l,2,...,/,- 1. 

Fig. 3 shows the graph code of graph Q. Where L(Q), Ae{Q) 
and Nw(Q) are generated by combining L(yi), Aeivf) and Nw(Vi) 
codes of all vertices Vj (/ = 0,1,2,3,4) with the element-wise ADD 
operation. Here, the element-wise ADD operation of counter 
strings Ci = < Ci [0],Ci [l],...,Ci [k] > and €2= < C2[0], 
C2[\],...,C2[k]> is defined as < Ci [0] + C2[0],Ci [1] + C2[l] 
,...,Ci[k]-\- C2[k]> , and CE{L,Ae,Nw}. For Lapsseqs(Q), we 
rank all the corresponding eigenvalues Laps(yi)[k] in the non- 
ascending order, and the results Lapsseql and Lapsseq2 are its 
Laplacian spectrum sequences Lapsseqs. 

The Properties of Graph Features 

In our coding method, the codes consist of the following 
features: i) the labels of vertices and adjacent edges, ii) Laplacian 
spectrum, and iii) the number of walks. Since these features have 
the following properties, we can use them to efficiently and 
effectively filter out false positives. 

The labels of vertices and adjacent edges. This is the first 
graph feature in our proposed method. As we all know, for each 
vertex (or edge) of a graph, there exists a corresponding vertex (or 



edge) in its supergraph. Based on this, we have the lemma as 
follows. 

Lemma 1 Let graph G\ be a subgraph of graph G2,for a specific label I, 
the number of vertices (or edges) with label I in G\ is not more than the number 
of vertices (or edges) with label I in G2. 

Applying the converse-negative proposition of Lemma 1 to 
vertices and graphs, we have the following corollaries. 

Corollary 1 Given two graphs Gi and G2, and the two vertices veGi 
and UEG2 have the same vertex label. If there exists a specific adjacent edge 
label I, and the number of adjacent edges with label I of vertex v is more than 
the number of adjacent edges with label I of u, then u is not a corresponding 
vertex of v. 

Corollary 2 Given two graphs G\ and G2, if there exists a specific label 
I, and the number of vertices (or adjacent edges) with label I in G\ is more than 
the number of vertices (or adjacent edges) with label I in G2, then Gi is not a 
subgraph of G2. 

Laplacian spectrum. We choose Laplacian spectrum as the 
second feature, since there exists a relationship between the 
Laplacian spectrum of a graph and the Laplacian spectra of its 
subgraphs, and this relationship can be used to efficiently filter out 
false positives. 

In order to prove there does exist the relationship, we first 
introduce Min — Max Theorem [22] as follows. 

Theorem 1 {Min — Max Theorem). Given a real symmetric matrix 
Anxn integer), and its eigenvalues are < A;^_2 < ••• < Aq. 
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Figure 3. gCode{Q), All of yCode{vuQ) (/ = 0,1,2,3,4) are combined to the gCode{Q). 
doi:1 0.1 371 /journal.pone.00971 78.g003 
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Then the eigenvalues of matrix A are represented as follows: 



{ 



X ^0,XeCn,X LW\ ,W2 ,-.,Wn-k ~X X 



(45)} = 4, 



and 



{ 



X i^Q,XeCn,X LW\ ,W2 ,...,Wn-k 



(4^)} = 4, 



where w^-l (9 ^ ^ ^ ^"1) ^i^d are n-dimensional vectors, and ^* is 
the transposition of x . 

In Algebraic Graph Theory [23], according to the properties of 
Laplacian matrices, the Laplacian matrix of a graph is a real 
symmetric matrix, and each eigenvalue of a Laplacian matrix is 
not less than zero. Thus, the Laplacian matrix of a graph is a real 
symmetric positive semidominant matrix. Applying Min- 
MaxTheorem to the positive semidominant matrixes, we can 
have the following corollary. 

Corollary 3 Let A^xn ^nxn be two real symmetric matrices, 
and their eigenvalues be Xn-\< Xn-i^---^ (ind [i^_i< 
Pj^_2< "' < Pq, respectively. If matrix (B — A) is a positive semidominant 
matrix, then for each kE[0,l,2,...,n— I}, Xk<^k holds. 

Proof. According to Min — Max Theorem, the eigenvalues of B 
can be represented as follows: 



W\ , W2 ,...,Wn-keCn ^ X ^0,X eC", X 1 W\ , W2 ,...,Wn-k 



^ 'x (B-A-\-A)'x ^^^ 



W\ , W2 ,...,Wn-k^Cn ^ X^Q,X gC«, X LW\ ,W2 ,-,Wn-k 



, X (B—A)x X (A)x 



= y,(B-AHh(A) 

where yj^ is the k-\h eigenvalue of matrix {B — A). As 
matrix(5 — ^) is a positive semidominant matrix, the is not 
less than zero. Thus, we have 



k{B) = Yk(B-A) + UA) > 0 + UA) = HA). 



According to Corollary 3, if two real symmetric matrices A and 
B satisfy that matrix (B — A) is a real symmetric positive 



semidominant matrix, the eigenvalues of B is not less than that 
of A. Since the Laplacian matrix of each graph is a real symmetric 
positive semidominant matrix, we can apply Corollary 3 to a 
graph and its subgraphs, and thus have the following theorem. 

Theorem 2 For graph Gi with m vertices and graph G2 with n 
{m < n) vertices, suppose 1) the matrix A^ x m ^he Laplacian matrix of G\, 
and Bnxn is the Laplacian matrix of G2; 2) the eigenvalues of matrix A are 
A^-l< <^nd the eigenvalues of matrix B are 

<jS„_2 < ... <i^0- If ^1 ^ subgraph of G2, then for each 



k = 0,\,...,m—\, Laplacian spectra of Gi and G2 

HGi)<MG2)- 

Proof, (sketch) Since G\ is a subgraph of G2, we can first generate 
a new graph G3 by adding (n — m) vertices to graph Gi , and these 
vertices occur in G2 but not in Gi ; and then achieve the nxn 
Laplacian matrix A' of G3 by adding (n — m) elements "0" to the 
mxm matrix A. This ensures that G3 is also a subgraph of G2, and 
A' have the same non-zero eigenvalues as A. Meanwhile, we 
generate a new graph G4 by removing the edges in G3 from G2. 
And Laplacian matrix of G4 can be denoted as matrix (B — A'). 
For a given graph, its Laplacian matrix is a real symmetric positive 
semidominant matrix. Thus, Laplacian matrices A', B and 
(B — A') are all real symmetric positive semidominant matrices. 
According to Corollary 3, for each ke{0,l,2,...,n—l}, we have 
Xk(G3)<l3j^(G2). Furthermore, for each ke{0,l,2,...,m—l}, 
4(C?i)<fc(G2) holds. 

Applying the converse-negative proposition of Theorem 2 to 
Laplacian spectra of graphs, we have a useful corollary as follows. 

Corollary 4 Given two graphs Gi with m vertices and G2 with n 
vertices (m<n), Laplacian spectrum of Gi is Xm-i< ^m-2'^---<^0, 
and Laplacian spectrum of G2 is P^_i<[i^_2<"'<Po- If there exists an 
integer k {0<k<m—l) such that X]Jfi\)> ^ j^Gi), then graph G \ is not 
a subgraph of graph G2. 

The number of walks. Paths of a graph are easier to extract 
and manipulate than trees and subgraphs, so GraphGrep [1 1] uses 
paths as index features. The indexes built on this kind of features 
are usually huge especially when graph databases are large and 
diverse, thus this method can be inefficient [12]. However, we find 
that the number of walks of length keN between two terminal 
vertices can also preserve the basic information of a graph, and the 
walks of a graph are much more easy to extract and manipulate 
than paths. Inspired by this, we extract the metrics including the 
number of walks with specific length as the feature for graph 
coding and further indexing. 

Generally speaking, for each walk from vertex Vf to vertex Vj in a 
graph, there must exist a corresponding walk from Vf (corre- 
sponding to V/) to Vj (corresponding to Vj) in its supergraph. Thus, 
we have the following lemma. 

Lemma 2 Given two graphs Gi and G2, and Gi is a subgraph of G2. 
For a vertex VieG\, there exists a corresponding vertex v- eG2, and v- 
satisfies that the number of walks of length Wfiom to all vertices with label 
I in graph G\ is not more than the number of walks of length Wfrom v'l to all 
vertices with label I in graph G2. 

Applying the converse-negative proposition of Lemma 2 to 
vertices and graphs, we have two useful corollaries as follows. 

Corollary 5 Given two graphs G\ and G2, and the vertices veG\ and 
UGG2 have the same vertex label. If there exists a specific vertex label I, and the 
vertex label I satisfies that the number of walks of length W from v to all 
vertices with label I in graph G\ is more than the number of walks of length W 
from u to all vertices with label I in graph G2, then u is not a corresponding 
vertex of v. 

Corollary 6 Given two graphs Gi and G2, f there exists a specific 
vertex label I, and it satisfies that the number of walks of length W from all 
vertices to all vertices with label I in graph G\ is more than the number of 
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walks of length Wfrom all vertices to all vertices with label I in graph G2, then 
G\ is not a subgraph of G2. 

According to the above corollaries, we can use these features to 
filter out false positives. In order to speed up the comparisons 
between graph features, we map these features into the numerical 
space to generate vertex and graph codes. In the following 
subsection, we discuss how to generate vertex and graph codes. 

The Proposed Coding Method 

In this subsection, we present the novel coding method 
consisting of three parts: i) L and Ag coding, ii) Laplacian 
spectrum coding, and iii) TV^v coding. 

L and coding. For a vertex v, as stated in former section, 
L{v) is a length-/v counter string to denote its vertex label, and Ae{v) 
is a length-4 counter string to denote its adjacent edge label. For 
each distinct vertex (or adjacent edge) label, we use hash function 
to set K(>\) out of /v (or 4) elements to 1. Then, L{v) is directly 
generated from the hash function of vertex label, and the code of 
each adjacent edge is directly generated from the hash function of 
adjacent edge label. By adding all adjacent edge codes with the 
element-wise ADD operation, we can generate Ag{v). 

For a graph G, L{G) and Ag{G) are generated by adding L{v) and 
Ae{v) of all vertices with the element-wise ADD operation. 

In Fig. 4, we use vertex V2 and graph Q^as examples to illustrate 
the generation process of the L and A^ codes. 

Figs. 4(a) and 4(b) are the hash functions of vertex label and 
adjacent edge label, respectively. For convenience sake, we denote 
distinct vertex (or adjacent edge) label by setting ^ to be 1. 

For vertex V2, L{v2) is the counter string of V2 in the hash 
function of vertex label. In order to generate Ag{v2), we first extract 
all the adjacent edges of vertex V2: <a, B>, <a, C> and <c, 
G > . Then, we use hash function of adjacent edge label to encode 
each adjacent edge. Finally, we add these adjacent edge codes to 
generate Ag{v2), as shown in Fig. 4(c). 



For graph we combine the Z(v/) and Ae{Vi) of all vertices V/ 
(/ = 0,1,2,3,4) to generate its L{Q} and Ae{Q} codes by performing 
the element-wise ADD operation, as shown in Fig. 4(d). 

Laplacian spectrum coding. Suppose graph G has n 
vertices. For each vertex v, we first generate its Level- jV Spanning 
Graph, and then choose some Laplacian eigenvalues of Level- jV 
Spanning Graph to generate its Laplacian spectrum Lap{v). The 
Level-jV Spanning Graph of a vertex is defined as follows. 

Definition 6 (Level-N Spanning Graph). Given a graph G and a 
vertex veG, Level-jV Spanning Graph of v, denoted as LNSG(G, N, v), 
is a subgraph representing the local structure around v, where v is a center 
vertex, and the vertices and edges in LNSG(G, N, v) must satisfy the follows: 

1. for each vertex v'eG, f the length of walk between v and v' is not more 
than JV, vertex v' is in LNSG(G,N,v); 

2. for each edge eeG, if the two terminal vertices of e are both in LNSG(G, 
JV, v), edge e is in LNSG{G,N,v). 

According to the above definition. Level- jV Spanning Graph of a 
vertex is unique. By ranking the Lap[v) of all the vertices in graph G 
in non-ascending order, we obtain Lapsseqs[G). 

In order to better understand the Level-jV Spanning Graph, 
Table 1: Algorithm 1 lists the generation process of IJVSG{G, JV, v). 

In Table 1: Algorithm 1, Lines 1-2 initialize the vertex set and 
edge set of IJVSG{G, JV, v), respectively. Line 3 adds vertex v to the 
vertex set of IJVSG{G, N, v). Line 4 uses the function SEEKlv, G, JV) 
to find the other vertices in LNSGfi, N, v). The Function SEEK{v, 
G, jV) uses the depth-first- search to traverse graph G and finds out all 
vertices in UVSG{G, N, v). Lines 5-9 look for all edges of LNSG{G, N, 
v). For a edge, if its two terminal vertices are both in the vertex set 
of LNSG[G, N, v\ we add this edge to the edge set of LNSG{G, N, v). 

Fig. 5 depicts the examples of Level-jV Spanning Graph in 
graphs Q^and 2)3, which both shown in Fig. 1. Fig. 5 (a) shows 
some Level-jV Spanning Graphs for vertices Vq, Vi, V2 and V4 in 
graph Dt,. Fig. 5(b) shows some Level-jV Spanning Graphs for 
vertices Vq, Vi, V2 and V3 in graph Q. Obviously, Level-jV Spanning 
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Figure 4. L and Ae coding for graph Q. In this figure, (a) is tlie liasli function of vertex label, (b) is the hash function of adjacent edge label, (c) is 
the generating process of L and Ae codes for vertex V2, and (d) is the generating process of L and Ae codes for graph Q. 
doi:1 0.1 371 /journal.pone.00971 78.g004 
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Table 1. Algorithm 1 Level-TV Spanning Graph Generation. 



Input: G is a graph, v is a vertex in G, N is the Level number of LNSG; 
Output: LNSG{G,N,v): 

1: LNSG.V: = (Z»//Jhe vertex set 
2: LNSG.E\^0;/r\\\e edge set 
3: LNSG.V\^LNSG.V[}{vy, 
4: SEEK(v,G,7V); 
5: for each edge e e G do 

6: if two terminal vertices of e are in LNSG.V then 
7: Insert the edge e into LNSG.E; 

8: end if 
9: end for 

10: return LNSG{G,N,v); 
Function: SEEK(v,G,A^) 
1: if A^= =0 then 
2: return; 
3: end if 

4: for each neighbor vertex u of vertex v do 
5: if u does not exist in LNSG.V then 
6: Insert the vertex u into LNSG.V; 

7: SEEK(m,G,A^-1); 
8: end if 
9: end for 

doi:1 0.1 371/journal.pone.00971 78.t001 

Graph of Vq, Vi, V2 and V3 in Q^are the subgraphs of that of V4, V2, 
Vi and Vq in D^, respectively. 

From Fig. 5, we also find that there exists the relationship of 
LNSG between two vertices, which are described by Lemma 3 as 
follows. 

Lemma 3 Let Gi and G2 be two graphs, and veG\ and v'eGj be two 
vertices which have the same vertex label, if G\ is a subgraph of Gj and V is 
the corresponding vertex of v in G2, then Level-N Spanning Graph of vertex 
V is a subgraph of the Level-N Spanning Graph of vertex V . 

Proof According to the subgraph isomorphism relationship, for 
each vertex u {u^v) in LNSG{Gi, JV, v), there exists a 
corresponding vertex u' {u' ^ v') in graph G2 . For each edge e in 
LNSG{Gi, JV, v), there exists a corresponding edge e' in graph G2. 
According to the definition of Level- jV Spanning Graph, there 
exists a walk of length w (1 <w<7V) between vertices u and v in 
LNSG{Gi, JV, v). For graph G2, there also exists a corresponding 
walk of length w between vertices u' and v'. Thus, vertex u' is in 
LNSG{G2, JV, V). That is, the corresponding vertex of each vertex 
in LNSG{Gi, JV, v) is in LNSG{G2, JV, v'). Similarly, all the 
corresponding edges of LNSG{Gi, JV, v) are also in LNSG{G2, JV, 
V). Thus, LNSG{Gi, N, v) is a subgraph of LNSG{G2, N, V). 

In the proposed method, we extract some Laplacian eigenvalues 
of LNSG{G, JV, v) to generate Laps{v), and generate Lapsseqs{G) via 
ranking the Laps{v) of all the vertices. 

In Fig. 6, we use graph Q^as example to illustrate the generating 
process of Laps[v) and Lapsseqs[G). We first compute Laplacian 
spectrum of each vertex v in graph Q, and extract first two largest 
Laplacian eigenvalues Eigenvaluel and Eigenvalue2 to generate 
Laps[v). According to non-ascending order, we rank the corre- 
sponding eigenvalues Eigenvaluel and Eigenvalue2 of all vertices to 
generate Lapsseqs{Q}, which contains two Laplacian spectrum 
sequences Lapsseql and Lapsseq2. For convenience sake, we choose 



first two largest eigenvalues to denote Laps{v), and the level JV of 
LNSG is set to 2. 

Nw coding. A length-/v counter string is used to code Nw{v) 
(or 7Vh;(G)), which is the number of walks of length W. It is 
generated from the M^-th power of graph G's adjacency matrix. In 
Algebraic Graph Theory [23], there exists a lemma with respect to 
the number of walks of length FT as follows. 

Lemma 4 Let J\Lq be the adjacency matrix of graph G, then the number 
of walks of length Wfrom the i-th vertex of G to thej-th vertex is {MQ^\j 
that is the entry in row i and column j of the W-th power of J\Lq. 

Given graph G and its adjacency matrix M^;, if the entry in row 
i and column j of M^; is 1, there exists a walk of length 1 between 
the z-th vertex and the j-th vertex in G. Similarly, the entry in row i 
and column J of the M^-th power of adjacency matrix {JV[q^) is k if 
and only if there exists k{ > 0) walks of length W between the i-th 
vertex and j-th vertex, where the vertices in a walk can be 
repetitive. Fig. 7 shows the J\Lq and J\Lq^ of graph Qj respectively. 

With the M^-th power of adjacency matrix J\Lq^ of graph G, for 
each vertex V/gG, we first extract all its walks of length W, and 
generate tuple < * ,Label{Vj)> by recording the label of the 
terminal vertex Vj in each walk. For the distinct tuple 
< * ,Label(Vj) > , we use the hash function of walks to set K 
out of /v elements to 1 . Then, we map each tuple < * ,Label(Vj) > 
into the numerical space by using the hash function of walks, and 
the result is Nw(< , Label (Vj)>). Finally, we add 
Nw(< ,Label(Vj)>) of all walks to generate N^ivi) with 
element- wise ADD operation. Similarly, we add Nw{Vi) of all 
vertices to generate Nw{G). Note that, tuple < * ,Label(Vj)> is 
used to represent all the walks of length M^from vertex Vf to vertex 
Vj, regardless the vertices or edges between them are same or not; 
And symbol ' * ' just represents the other vertices and edges 
appeared in a walk. 

In Fig. 8, we use vertex V2 and graph Q^as examples to illustrate 
the generation process of Nw{v2) and N^iQ}. 

Fig. 8(a) is the hash function of walks, where we represent the 
distinct walk by setting 1 (^=1) out of /y elements to 1, and the 
length Wis set to 2. For vertex V2, we first extract its four walks of 
length 2: three walks < ^ ,A> and one walk < * ,Z)> according 
to Mq^ in Fig. 7, and generate Nw(< ^ ,A>) and N^v(< * ,^>) 
according to the hash function of walks. By adding Nw{< * ,A>) 
and N^;(< * ,Z)>) with element-wise ADD operation, we obtain 
Nw(v2), as shown in Fig. 8(b). For graph we add A^,v(v/) of all 
the vertices (/ = 0,1,2,3,4) to get Nw{Q}, as shown in Fig. 8(c). 

With the help of the above methods, we can extract these graph 
features and generate the corresponding codes. By combining L{v), 
Ag{v), Laps{v) and Nw{v) of the vertex ^; in a graph, we can generate 
vCode{v, Q), as shown in Fig. 2. By combining L{Vi), Ag{Vi), Laps{Vi) 
and Nw{Vi) of all the vertices V/ in graph we can generate graph 
code gCode{Qi, as shown in Fig. 3. 

Index Building 

Based on the coding method, we build a graph index named 
LnGCode-Tree, which can improve the filtering efficiency. The 
construction method of the LnGCode-Tree is presented below. 

LnGCode-Tree is based on the GC ode-Tree, which is first 
proposed in GCoding [17]. Similar to S-Tree [24] and GCode- 
Tree, LnGCode-Tree is also used to handle the signature files, and 
can be efficient for reducing the number of pairwise comparisons. 
LnGCode-Tree is a balanced tree as well, and each index node in 
LnGCode-Tree has at least m (m<2) and at most M 
((M+l)/2>m) children. Different from GCode-Tree, we use 
the labels of vertices and adjacent edges and the number of walks 
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Graph LNSG{D^Xvo) LNSG{D^,2,v i) LNSG{D^Xv2) LNSG{D^Xv4) 

(a) LNSG examples in graph Ds 




Graph Q LNSG{Q,2M LNSGiQ,2,vi) LNSG{QX^2) LNSG{Q,2,V3) 

(b) LNSG examples in graph Q 

Figure 5. Examples of LNSG'in graphs D3 and Q, In the example, (a) includes the graph D3 and the LNSG of vertices vo,vi,V2,V4; (b) includes the 
graph Q and the LNSG of vertices vo,vi,V2,V3. 
doi:1 0.1 371 /journal. pone.00971 78.g005 



to build LnGGode-Tree, while GCoding just uses the labels of 
vertices and adjacent edges to build GC ode-Tree. 

Fig. 9 is a LnGGode-Tree, it is built for the graphs in Fig. 1. 
The building process can be illustrated as follows. 

For each graph /)/, its L{Di), Ae{Di) and Nw{Di) codes of 
gCode(Di) are used to build index tree. For graphs with the same 
L, Ag and codes, a leaf node LNode is built. The code of UVode is 
consist of the L, Ag and codes of graphs Z)/ (/= 1,2,3,4), 
and UVode also contains the identities of these graphs. An 
intermediate node UVode has m children CJVode, its code is 
generated as follows: for each element j in UVode, IJVo- 
de.L{j] = Max{CNodei.L\j]), INode.Ae\j\ = Max{CNodei.Ae[j]), and 
UVode.N,v[j] = Max{CNodei.Nw[j]), where /=l,...,m. 



After the index tree is built, our method generates novel two- 
step filtering conditions, and follows the filtering-and-verification 
framework to conduct query processing. 

Two-Step Filtering Conditions 

In this subsection, we present the two-step filtering conditions 
according to the properties of the graph features, and prove that 
these conditions satisfy the no-false-negative requirement. 

Filtering condition of vertices. Applying GoroUary 1, 
Lemma 3, Theorem 2 and GoroUary 5 to vertices, we have a 
theorem as follows. 

Theorem 3 Let G\ and G2 be two graphs, veG\ and VeGi be two 
vertices, and vCode{v,G\)= <L{v),Ae{v), Laps(v),Nw(v)> and 
vCode(v\G2)= <L(v'),Ae(v'),Laps(v'),Nw(v')> be the codes of 
vertices v and v' respectively. If Gi is a subgraph of G2 and v' is the 
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Figure 6. Lap{\/\ and Lapsseqs(G) coding for graph Q, The Lapsseqs of graph Q is generated by ranking the Laps of all vertices (/ = 0,1, 2,3,4). 
doi:1 0.1 371 /journal.pone.00971 78.g006 
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Figure 7. Mq and Mq^, For the graph Q, (a) is the adjacency matrix of graph Q; (b) is the square of the adjacency matrix of Q. 
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corresponding vertex of v, then vCode{y,G\) and vCode(v' ,02) satisfy the 
following conditions: 

1 . L(v') [/] = L(v) [/],/ = 0, 1 ,2, ... ,/v - 1 ; 

2. Ae(v')[i\>Ae(v)[i\, / = 0,1,2,...,4-1; 

3 . Laps(y') [i] > Laps{y) [/] , / = 0, 1 ; 

4. N^iv') [/] > N^iv) [/] , / = 0, 1 ,2, . . . , /, - 1 . 

Proof Since G\ is a subgraph of G2, and v' is the corresponding 
vertex of thus the labels of V and v are same and their L codes 
are identical as well. That is, their L codes satisfy condition 1). 
According to Corollary 1, for each edge label /, the number of 
adjacent edges with label / of V is not less than that of thus their 
codes satisfy condition 2). According to Lemma 3 and 
Theorem 2, LNSG{Gu K v) is a subgraph of LNSG{G2, JV, v'), 
and the Laplacian spectra of LNSG{Gi, JV, v) and LNSG{G2, JV, v') 
satisfy condition 3). According to Corollary 5, for each vertex label 
/, the number of walks of length W from v to all the vertices with 
label / in Gi is not more than the number of walks of length W 
from v' to all the vertices with label / in G2, thus their N^^ codes 
satisfy condition 4). Therefore, Theorem 3 is correct. 

Theorem 3 shows the relationship between the codes of a vertex 
and its corresponding vertex. Applying the converse-negative 
proposition of Theorem 3 to vertices, we have the following first 
filtering condition. 

Filtering condition 1 (Filtering Condition of Vertices). Let Gi and 
G2 be two graphs, andvCode{v,G\)= <L{v), Ae(v),Laps(v),Nw(v)> 
be the code of vertex veGi, if there does not exist a vertex v'eG2, and its code 



vCode{v' ,G2)= <L{v'),Ae{v'),Laps{v'),N^^{v')> satisfies the follow- 
ing conditions: 

1 . L{V) [/] = L(v) [/],/ = 0, 1 ,2, ...,/,- 1 ; 

2. Ae{V)[i\>Ae{y)[i\, / = 0,1,2,...,4-1; 

3. Laps{V)[i\ >Laps{v)[ii\, / = 0,1; 

4. N,,{V) [/] > 7V,,(v) [/] , / = 0, 1 ,2, . . . , /, - 1 . 

then G\ is not a subgraph of G2. 

Lemma 5 Filtering Condition of Vertices satisfies no -false -negative 
requirement for subgraph query problem. 

Proof (Proof by contradiction) We assume the Filtering Condition of 
Vertices does not satisfy the no-false-negative requirement. Let G2 
be a graph and G\ be its subgraph, and Filtering Condition of Vertices 
do not satisfy the no-false-negative requirement if and only if G2 
can be pruned by Filtering Condition of Vertices. That is, for a specific 
vertex veG\, there does not exist a vertex v'eG2, and the v Codes of 
V and v' satisfy the conditions 1), 2), 3) and 4) in Filtering Condition of 
Vertices. According to Theorem 3, for each vertex veG\, there must 
exists a corresponding vertex VEG2, and the v Codes of v' and v 
satisfy the conditions 1), 2), 3) and 4) in Filtering Condition of Vertices. 
Thus, graph G2 cannot be pruned by Filtering Condition of Vertices. 
This contradicts the assumption. Therefore, Lemma 5 is correct. 

Filtering conditions of graphs. Applying Corollary 2, 
Lemma 3, Theorem 2 and Corollary 6 to graphs, we have 
another theorem as follows. 

Theorem 4 Let m and n (m<n) be the numbers of vertices and 
gCode{G\)= <L{G\),Ae{G\),Lapsseqs{G\), Nw(Gi)> and 
gCode{G2)= <L(G2),Ae(G2),Lapsseqs(G2),Nw(G2)> be the codes 
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Figure 8. Coding. For the graph Q, (a) is the hash function of walks; (b) is the generating process of 7V,i; code for vertex V2; (c) is the generating 
process of code for graph Q. 
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Figure 9. An example of LnGCode-Tree. 
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of graphs Gi and G2 respectively, if Gi is a subgraph of G2, then their graph 
codes gCode{G\) and gCode{G2) satisf the following conditions: 

1. L(Gi)[/]<L(G2)[/], / = 0,l,2,...,/,-l; 

2. Ae{Gx)[i\<Ae{G2)[i\, / = 0,1,2,...,4-1; 

?>. Lapsseqs{G\)j^[i\<Lapsseqs{G2)j^[i\, / = 0,1 and 

/: = 0,l,...,m-l; 

4.7V,(<^i)[/]<7V,(<^2)[/], / = 2,...,/,-l. 

Proof (sketch) The conditions 1), 2) and 4) can be directly 
derived from Corollary 2 and Corollary 6. Condition 3) is proved 
as follows. Since Lapsseqs{G\)jJ\i\ is a sorted list in non-ascending 
order, there exist k-\-\ vertices Vj (j = 0,1,2, ...,k) in G\, and the 
Laplacian eigenvalue Xj[i]>Lapsseqs(Gi)j^[i]. According to The- 
orem 2, for each vertex Vj (/ = 0,1,2,...,/:) in Gi, there exists a 
corresponding vertex v/ in G2, and the Laplacian eigenvalues 
/ly' [/] > /ly [/] (7 = 0,1,2,...,/:). That is, Lapsseqs(Gi)j^[i\< 
Lapsseqs(G2)i^[i] (/ = 0,1). Thus the condition (3) is correct. 

Applying the converse-negative proposition of Theorem 4 to 
graphs, we have the second filtering condition. 

Filtering condition 2 (Filtering Condition of Graphs). Let m and n 
(m<n) be the numbers of vertices and gCode{G\)= <L{G\), 
Ae{G\), Lapsseqs{G\),Nw{G\)> and gCode(G2)= <L(G2), 
Ae(G2),Lapsseqs(G2), Nw(G2) > be the codes of graphs G\ and G2 
respectively, if gCode{G\) and gCode{G2) do not satisfy the following 
conditions: 

\.L(Gi)[i\<L(G2)[i\, / = 0,l,2,...,/,-l; 

2. Ae(Gi)[i\<Ae(G2)[i\, / = 0,1,2,...,4-1; 

3. Lapsseqs(Gi)/^[i]<Lapsseqs(G2)k[i], i = 0,\ and 
k = 0,l,...,m-\; 

4. N^(Gi)[i]<N,,(G2)[i], / = 2,...,/,-l. 

then Gi is not a subgraph of G2. 

Lemma 6 Filtering Condition of Graphs satisfies the no -false -negative 
requirement for subgraph query problem. 

Proof Similar to Lemma 5, this lemma can be proved by 
contradiction according to Theorem 4. 

Filtering and Verification 

Based on the index and filtering conditions, we follow the 
filtering-and-verification framework to query subgraphs. 

Firstly, we use two-step filtering conditions to filter out false 
positives. In the first step, we traverse the LnGCode-Tree of graph 
database with Filtering Condition of Graphs. Specifically, the graph 



code gCode(Q) of query graph is compared with the 
intermediate node INodck. If there exists an element i, and it 
satisfies one of these conditions: i) gCode(Q).L[i] > INodek-L[i]] ii) 
gCode{Q).Ae[i\ >INodek.Ae[i]; or iii) gCode{Q).N^,[i\ > 
INodek.N^;[i], then the children of INodck are pruned; otherwise, 
the graph code gCode(Q) is compared with each child of INodek- 
For the leaf node LNode^, if there exists an element i, and it 
satisfies one of these conditions: i) gCode(Q).L[i]>LNodek.L[i]; 
ii) gCode(Q).Ae[i]>LNodek.Ae[i]; or iii) gCode(Q).N^\i]> 
LNodek.Nw[i], then the graphs contained in LNode^ can be 
pruned as false positives; otherwise, the graphs contained in 
LNodck are added to candidate graphs. After traversing 
LnGCode-Tree, LnGCoding filters out some false positives, so 
the graph database is reduced. Then we compare the Lapsseqs of 
the query graph with those of the reduced graph database, since 
LnGCode-Tree only includes L and Ag^ codes. Through this 
step, we obtain the primary candidate graph set for the query 
graph. 

This step can be illustrated by the graphs in Fig. 1 and the 
corresponding LnGCode-Tree in Fig. 9. When traversing INode2, 
we find that gCode(Q).Ae[0] = 1 > INode2.Ae[0]=0, thus graphs 
Di and D2 are pruned. When traversing LNode4, we find that 
gCode(Q).Ae[2] = \>LNode4.Ae[2]=0, so graph D4 is pruned. 
Then, by comparing the Lapsseqs of query graph Q^and graph D^, 
we find Dt, is a candidate of Q. 

In the second step, we use Filtering Condition of Vertices to filter out 
more false positives. Specifically, we compare each vertex code of 
the query graph with all the vertex codes of each graph in the 
primary candidate graph set until all the candidate vertices of this 
vertex have been found. By now, the candidate graph set and the 
candidate vertex set are generated. 

In Fig. 1 0, we use graph as query graph and Dt, as the 
primary candidate graph set to illustrate the second step filtering 
process. 

The vertex codes of all vertices in graphs Dj^ and Q^are shown in 
Fig. 10(a). After filtering with Filtering Condition of Vertices, we 
generate the candidate vertex set of each vertex in query graph 
as shown in Fig. 10(b). For each vertex in there exist the 
corresponding candidate vertices in D^. Thus, is a candidate 
graph of Q. 

After the filtering is finished, in the verification phase, we use 
the state-of-the-art subgraph isomorphism algorithm VF2 [25,26] 
to validate each candidate graph, and obtain the supergraph set 
for a query graph. 
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Figure 1 0. A Filtering Example. For the labeled graphs D3 and Q, (a) 11: 
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Experimental Results and Discussion 

In this section, after introducing the data source, the benchmark 
methods and parameter setting, and the evaluation criteria, we 
report the experimental results on efficiency comparison of the 
different methods, and test the scalability of our method. 

Data Source 

In this study, both real and synthetic graph databases are used. 

Real graph database. The AIDS antiviral screen database 
contains 43,905 classified chemical molecules, and is publicly 
available. Many researchers such as Yan et al. [12], Shang et al. 
[16], Zou et al. [17], and He and Singh [18] used one of its subset 
to test their methods, we chose it as benchmark data as well. 

The subset consists of 10,000 graphs as default database. On 
average, each graph has 25.4 vertices and 27.3 edges, which 
means that most of graphs in this real graph database are sparse 
graphs. Six query graph sets Q4, Q8, 02, Q16, Q20 and Q24 are 
used to validate the efficiency of subgraph querying methods. Each 
query graph set Qi (/ = 4,8,12,16,20,24) consists of 1,000 query 
graphs with i edges. 

Synthetic graph database. GraphGen [27] is a synthetic 
graph generator. In order to test the performance of existing 
methods on dense graphs, Han et al. [19,20] used it to generate 
the synthetic graph database Synthetic. 10K.E30.D5.L50. The 
cardinality of the synthetic database is 10,000, the average size of 
graphs is 30, the density for each graph is 0.5, and the number of 
vertex/edge labels is 50. 

Benchmark Methods and Parameter Setting 

Benchmark methods. The representative methods gindex 
[12], FGTndex [13], Tree+delta [15], Swiftlndex [16], GCoding 
[17], and Closure-tree [18] are selected to be compared with our 
method. Since LsGCoding [21] aims at coding graphs with 
unlabeled edge, and optimizes the subgraph isomorphism 
algorithm according to the properties of graphs with unlabeled 
edge, thus in our experiments on graph databases with labeled 
edges, we do not compare LsGCoding with our method. 

All these methods are implemented on the iGraph framework 
[19,20], this enables fair performance comparisons for different 
methods. 



Parameter setting. Our proposed method has three param- 
eters: the level of LNSG, the number of first largest Laplacian 
eigenvalues, and the length of walks. Fig. 11 shows the impact of 
these parameters on the real graph database. 

Fig. 1 1 (a) shows the impact of the level of LNSG on the 
candidate set size. It indicates that when we choose more levels of 
LNSG, the candidate set size will become smaller. However, the 
more levels of LNSG we choose, the more time will be consumed 
in computing Laplacian spectrum. Moreover, choosing 3 or more 
levels cannot lead to significant reduction in the candidate set size. 
Therefore, the level JV of LNSG is set to 2. 

Fig. 1 1 (b) shows the impact of Laplacian eigenvalues on the 
candidate set size. We observe that choosing more Laplacian 
eigenvalues can reduce the size of the candidate graph set, but will 
result in the larger graph code database and more code 
comparison time. At the same time, choosing 4 or more Laplacian 
eigenvalues cannot lead to significant reduction in the candidates 
set size. Therefore, we choose the first three largest eigenvalues in 
our method. 

Fig. 1 1 (c) shows the impact of the length of walks on the 
candidate set size. From it we know that longer length of walks will 
result in more computation time of matrix , and choosing 3 or 
greater length cannot lead to significant reduction in the candidate 
set size. Thus we set the length Wto 2. 

As recommended in [17] and [28], the length of Z, Ag and N^ 
codes are set to 30 (i.e. /v = 4 = 30). 

For methods gindex, FG-Index, Tree+delta, Swiftlndex, 
GCoding and Closure-tree, the recommended parameter values 
are used. That is, for all substructures based index methods, the 
support threshold is set to 10%, and the maximum feature size 
maxL is set to 10. For gindex and Swiftlndex, is set to 2. For 
FG-Index, 3 is set to 0.1. For gindex, the same size-increasing 
function as in [12] is followed. For GCoding, the level JV of LNPT 
is set to 2 and the number of eigenvalues to 2. 

Evaluation Criteria 

A subgraph query algorithm usually consists of two processes: i) 
coding and indexing, and ii) subgraph querying. In this section, we 
briefly introduce some criteria metrics used to evaluate the 
efficiency of these two parts. 
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Figure 11. Impacts of Parameters on Candidate Set Size, (a) Level of LNSG; (b) Laplacian eigenvalues; (c) Length of Walk. 
doi:1 0.1 371 /journal.pone.00971 78.g01 1 



Criteria for coding and indexing. The coding and indexing 
time and the index size for both graph codes and the index tree are 
used in this process. 

a. Coding and Index Time. The coding and indexing time is the 
run time used to encode both graphs and their vertices and 
build the index tree. A less coding and indexing time means 
higher performance in this process. 

b. Index Size. The index size is the size of space used to store 
both the graph codes and the index tree. In the filtering phase, 
much time is spent on accessing a larger index, so it partly 
impacts the filtering efficiency. 

Criteria for subgraph querying. The candidate set size, the 
filtering time, the verification time and the response time are used 
in this process. 

a. Candidate Set Size. The candidate set size is the number of 
candidate graphs for each query graph. For each subgraph 
query algorithm, a smaller candidate set size implies higher 
filtering efficiency. 

b. Filtering Time. For each subgraph query method, the 
Filtering time is the run time to traverse the index to filter 
out false positives and generate the candidate set. A less 
filtering time implies higher filtering efficiency. 

c. Verification Time. For each subgraph query method, the 
verification time is the run time to verify each candidate and 
generate the result set. A less verification time implies higher 
verification efficiency. 

d. Response Time. For each subgraph query method, the 
response time is defined as the sum of the filtering time and 
the verification time. A less response time means the higher of 
querying efficiency. 

Our experiments evaluate the efficiency of different subgraph 
query methods. For each subgraph query method, the run time is 
the most important criterion in each phase. Thus, in the first 
phase, the coding and index time is the primary criterion; and in 
the second phase, the response time is the primary criterion. 

Performance on Real Graph Database 

Performance of coding and indexing. Fig. 12 shows the 
performance of the seven methods on the real graphs in the coding 
and indexing process. 

Coding and Indexing Time. Fig. 12(a) shows the coding and 
indexing time of all the seven methods on the real graph database. 
From it we observe that, with the increasing of database size from 



2 K to 10 K, the coding and indexing time of each methods is 
increasing. 

Compared with Closure-tree, since LnGCoding must compute 
the expensive Laplacian spectrum, thus the coding and indexing 
time in LnGCoding is more than that of Closure-tree. 

In the coding based index methods, LnGCoding computes not 
only the Laplacian spectrum but also the number of walks. Thus, 
the coding and indexing time in LnGCoding is the larger than that 
of GCoding. 

For the substructure based index methods, they extract graph 
features via expensive frequent subgraph or subtree mining. Thus, 
their coding and indexing time is greater than that of LnGCoding. 

In a word, the coding and indexing time of our method is much 
less than that of the substructure based index methods, and is 
comparable with those of GCoding and Closure-Tree. 

Index Size. Fig. 12(b) shows the index sizes of the seven methods 
on the real graph database. From it we know that, when the 
database size is increasing from 2 K to 10 K, the index size of 
each method is also increasing. 

The index size of Closure-tree is more than that of LnGCoding, 
since the coding based index methods both map the information of 
graph features into the numerical spaces, which can save the store 
space. 

The index size of LnGCoding is more than that of GCoding, 
since the code in LnGCoding consists three parts: the labels of 
vertices and adjacent edges, the Laplacian spectrum, and the 
number of walks; while the code in GCoding contains two parts: 
the labels of vertices and adjacent edges, and the graph spectrum. 

Since FG-Index generates all frequent subgraphs and all 
infrequent edges for completeness, its index size is greater than 
that of LnGCoding. For the other substructure based index 
methods, their index sizes are less than that of LnGCoding, 
because the sizes of mined features or the numbers of mined 
features are small [19]. 

Performance of querying. Fig. 13 shows the performance of 
the seven methods on the real graphs in querying process. 

Candidate Set Size. Fig. 1 3(a) shows that, when query graph set is 
varying from Q24 to Q4, the candidate set size of each method is 
increasing. This is because the answer set is increasing. When 
query size is larger, such as Q24 and Q20, the candidate set sizes of 
the clustering based and coding based index methods are less than 
those of the substructure based index methods; while when the 
query size is smaller, such as Q8 and Q4, the candidate set sizes of 
the clustering based and coding based index methods are greater 
than those of the most substructure based index methods. The 
reason is that for these substructure based index methods, more 
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Figure 12. Performance of Coding and Indexing on Real Data, (a) Coding and Indexing Time; (b) Index Size. 
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Figure 13. Performance of Querying on Real Data, (a) Candidate Set Size; (b) Filtering Time; (c) Verification Time; (d) Responde Time. 
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features are mined on the smaller sized graphs than on the larger 
sized graphs. 

Closure-tree prunes more false positives than that of LnGCod- 
ing, since it conducts the pseudo subgraph isomorphism testing, 
which is similar to the exact subgraph isomorphism algorithm. 

Different from the graph spectrum in GCoding, LnGCoding 
uses Laplacian spectrum and the number of walks as graph 
features, thus the candidate set size of LnGCoding is less than that 
of GCoding. 

For the substructure based index methods, since their mined 
index features are less for larger sized query graphs than for 
smaller sized query graphs, their candidate set sizes are greater 
than those of LnGCoding when the query graph sets are Q24 and 
Q20. When the size of the query graph is smaller, such as Q12, Q8 
and Q4, the candidate set sizes of gindex, Tree+delta and 
Swiftlndex are less than those of LnGCoding. FG-Index generates 
the largest candidate set size, this is because it traverses the index 
to find a subset of mined features which is a subgraph of the query 
graph. This means it does not find out all subgraphs of a query 
graph from its index. 

Filtering Time. Fig. 13(b) shows that, when the query graph set is 
varying from Q24 to Q4, the filtering time of the clustering based 
and coding based index methods is increasing, while the filtering 
time of the substructure based index methods is decreasing. The 
reason is that for the substructure based index methods, there are 
less index features in query graph set Q4 than in Q24, thus there 
are less comparisons between the query graph and the index 
features in Q4 than in Q24. 

From Fig. 13(b) we also know that the filtering time of Closure- 
tree is the largest, as it conducts the pseudo subgraph isomorphism 
testing that is quite time consuming. 

The vertex and graph codes of LnGCoding are more complex 
than those of GCoding, and the code comparison of the former is 
more expensive than that of the latter. Thus, the filtering time of 
LnGCoding is slightly greater than that of GCoding. 

For the substructure based index methods, since their index 
sizes are less than that of LnGCoding, they traverse the index to 
filter out false positives with less time. Thus, the filtering time of 
most of them is less than that of LnGCoding. 

Verification Time. Fig. 13(c) shows that, when the query graph set 
is varying from Q24 to Q4, the verification time of most methods 
are increasing. 

Under the iGraph framework. Closure- tree employs a Java 
bytecode analyzer to verify candidates, while LnGCoding uses the- 
state-of-art subgraph isomorphism algorithm VF2 [25] to verify 
candidates. Although Closure-tree has the smaller candidate set 
size than that of LnGCoding, the verification time of Closure-tree 
is more than that of LnGCoding. 

For the graph coding based index methods, the candidate set 
size of LnGCoding is slightly less than that of GCoding, so the 
verification time of the former is also slightly less than that of the 
latter. 

For the substructure based index method FG-Index, its 
verification time is less than that of LnGCoding for query graph 
set Q4, and is more than those of LnGCoding for other query 
graph sets. The reason is that FG-Index employs a verification free 
strategy: when the query graph is an indexed feature, it directly 
reports the answer set without verification. Since Q4 has most 
indexed features for all query graph sets, the verification time of 
FG-Index is less than those of the other methods. 

The verification time of gindex is slightly less than those of 
LnGCoding for query graph sets Q24 and Q20. The reason lies in 
that, the candidate set sizes of gindex are slightly more than those 
of LnGCoding, and the index size of gindex is much less than that 



of LnGCoding, so its cost for finding the candidate graphs is less 
than that of LnGCoding. For other query graph sets, the 
verification time of gindex is less than those of LnGCoding, since 
the candidate set sizes of gindex are much less than those of 
LnGCoding on these query graph sets. 

The verification time of LnGCoding is less than those of Tree+ 
delta for query graph sets Q24 and Q20, and is greater than those 
of Tree+delta for query graph sets Q16, Q12, Q8 and Q4. It is 
because the candidate set sizes of the former are much less than 
those of the latter for query graph sets Q24 and Q20, and the 
candidate set sizes of the former are greater than those of the latter 
for query graph sets Q16, Q12, Q8 and Q4. 

Due to the sizes of candidate set, the verification time of 
LnGCoding is less than those of Swiftlndex for query graph sets 
Q24, Q20 and Q16, and is greater than those of Swiftlndex for 
query graph sets Q8 and Q4. For query graph set Q12, the 
verification time of LnGCoding is slightly more than that of 
Swiftlndex, since the candidate set size of Swiftlndex is slightly 
more that of LnGCoding for query graph set Q12, and the index 
size of Swiftlndex is much less than that of LnGCoding. 

Response Time. Fig. 13(d) shows that, when the query graph set is 
varying from Q24 to Q4, the response times of most methods are 
increasing. 

The filtering time and the verification time of Closure-tree both 
are the largest, so its response time is the biggest. 

Since the filtering time of LnGCoding is much less than that of 
GCoding, and its verification time is smaller than or comparable 
to that of the latter, the response time of LnGCoding is less than 
that of GCoding. This means that LnGCoding performs best on 
the real graph database among the clustering based and coding 
based index methods. 

For the substructure based index method Swiftlndex, its filtering 
time is much less than those of LnGCoding on all query graph sets, 
so the response time is less than those of the latter as well. 

For the query graph set Q24, the filtering time of LnGCoding is 
much less than that of gindex, thus its response time is less than 
that of the latter. For other query graph sets, the filtering time and 
verification time of LnGCoding both are greater than those of 
gindex, so its response time is greater than those of the latter. 

For the query graph sets Q24 and Q20, the filtering time of 
LnGCoding is much less than those of Tree+delta, and the 
verification time of LnGCoding is much less than those of FG- 
Index, thus its response time is less than those of Tree+delta and 
FG-Index. For other query graph sets, the filtering time of 
LnGCoding are greater than or much greater than those of Tree+ 
delta and FG-Index, thus its response time is greater than those of 
Tree+delta and FG-Index. 

According to the experimental results on real data, our method 
works well with larger query size. For the small query size, our 
method is faster than GCoding and Closure-Tree, but slower than 
the substructure based index methods. 

In a word, for the real data experiment, the response time of 
LnGCoding is not as good as substructure-based methods like 
Swiftlndex, but LnGCoding outperforms these substructure-based 
methods regarding coding and indexing. 

Performance on Synthetic Graphs 

Performance of coding and indexing. Fig. 14 shows the 
performance of the seven methods on the synthetic graphs in the 
coding and indexing process. 

Coding and Indexing Time. Fig. 14(a) shows the coding and 
indexing time of the seven methods on the synthetic graph 
database. From it we know that, with the increase of the database 



PLCS ONE I www.plosone.org 



13 



May 2014 | Volume 9 | Issue 5 | e97178 



LnGCoding 



size, the coding and indexing time of each method is also 
increasing. 

Since LnGCoding must compute the expensive graph spectrum, 
thus the coding and indexing time of LnGCoding is greater than 
that of Closure-tree. 

When computing graph spectrum, GCoding generates Level-TV 
Path Tree [LNPT) and LnGCoding generates LNSG. However, 
LNPT is built by adding reduplicate vertices, and LNSG is 
generated without any reduplicate vertices. Fig. 15 shows the 
differences between LNSG and LNPT of vertex Vq in graph D2, 
which occurred in Fig. 1. 

From Fig. 15 we observe that LNSG^Dj, 2, Vq) contains 4 
vertices, but LNPT(D2, Vq, 2) contains 8 vertices. Obviously, 
LNPT(D2, Vo, 2) contains four reduplicated red vertices: one 
vertex Vi, one vertex V2 and two vertices V3. Since the 
computational complexity of graph spectrum is 0(N^) {JV is the 
number of vertices), GCoding is much more time consuming than 
LnGCoding, specially when the graph is dense. In the synthetic 
graph database, most graphs are dense. Thus, the coding and 
indexing time of LnGCoding is less than that of GCoding. 
Meanwhile, we can see that LNPT does not contain the cycles 
occurred in the graph, which degrades the filtering efficiency. 

For the substructure based index methods, the coding and 
indexing time of gindex is the largest due to it mines much more 
features, and the coding and indexing time of Tree+delta and 
Swiftlndex is smaller than that of LnGCoding because the mined 
features are less. 

In a word, the coding and index time of our method is much less 
than that of gindex and GCoding, and is comparable with the 
fastest method Tree+delta. 

Index Size. Fig. 1 4(b) shows the index size of the seven methods 
on the synthetic graph database. From it we know that, with the 
increase of database size, the index size of each method is also 
increasing. 

Since most of synthetic graphs are dense, LnGCoding must use 
more space to store the Laplacian spectrum. Thus, the index size 
of LnGCoding is greater than that of Closure-tree. 

For the coding based index methods, GCoding generates 
LNPT by adding some reduplicate vertices while LnGCoding 
generates LNSG without any reduplicate vertices, thus the index 
size of LnGCoding is smaller than that of GCoding. 



For the substructure based index methods, the mined features of 
gindex are much more than those of others, so its index size is 
greater as well. Moreover, the mined index features of these 
substructure based index methods are smaller subgraph or 
substructures, thus the index size of LnGCoding is bigger than 
those of these methods. 

Performance of querying. Fig. 16 shows the performance of 
the seven methods on the synthetic graphs in querying process. 

Candidate Set Size. Fig. 1 6(a) shows the candidate set sizes of the 
seven methods on the synthetic graph database. We observe that, 
when the query graph size is varying from Q24 to Q4, the 
candidate set size of each method is increasing, this is because the 
answer set size of each method is increasing. 

Closure-tree conducts the pseudo subgraph isomorphism testing 
in the filtering phase, thus its candidate set size is less than that of 
LnGCoding. 

For the coding based index methods, GCoding and LnGCoding 
roughly have the same number of candidates. 

For the substructure based index methods, the candidate set 
sizes of Tree+delta are less than those of LnGCoding on query 
graph sets Q24, Q20 and Q16, since it takes too much time to filter 
out false positives on these query graphs. For other query graph 
sets, the candidate set sizes of LnGCoding are smaller than those 
of Tree+delta. For the other substructure based index methods, as 
their index features are not effective for dense graphs, their 
candidate set sizes are greater than those of LnGCoding. 

Filtering Time. Fig. 16(b) shows the filtering time of the seven 
methods on the synthetic graph database. 

Since Closure-tree conducts the pseudo subgraph isomorphism 
testing to filter out false positives, thus its filtering time is much 
greater than that of LnGCoding. 

For the coding based index methods, GCoding filters out more 
false positives than that of LnGCoding, thus its filtering time is 
greater than that of LnGCoding. 

For the substructure based index methods, gindex has the most 
mined features, and the sizes of most index features are small. For 
the query graph sets Q24, Q20 and Q16, gindex uses ineffective 
features to minimize the number of candidates, thus its filtering 
time is greater than those of LnGCoding on these query graph 
sets. For other query graph sets, its filtering time is less than those 
of LnGCoding. 



-O— LnGCoding □ GCoding ^ Closu re-tree gindex I FG-lndex—0— Tree+delta A Swiftlndexl 




Database Size Database Size 

(a) Coding and Indexing Time (b) Index Size 

Figure 14. Performance of Coding and Indexing on Synthetic Data, (a) Coding and Indexing Time; (b) Index Size. 
doi:10.1371/journal.pone.0097178.g014 



PLOS ONE I www.plosone.org 



14 



May 2014 | Volume 9 | Issue 5 | e97178 



LnGCoding 




Graph D2 LNSG{D2,2, 1 ) LNPT(D2, 1 ,2) 

Figure 15. LNSG and LNPT of vq e D2. 

doi:1 0.1 371 /journal.pone.00971 78.g01 5 
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Figure 16. Performance of Query Processing on Synthetic Data, (a) Candidate Set Size; (b) Filtering Time; (c) Verification Time; (d) Responde 
Time. 

doi:1 0.1 371/journal.pone.00971 78.g01 6 



PLOS ONE I www.plosone.org 



15 



May 2014 | Volume 9 | Issue 5 | e97178 



LnGCoding 





Figure 17. Performance on Graphs with Varying Sizes, (a) Coding and Indexing Time; (b) Index Size; (c) Candidate Set Size; (d) Filtering Time; 
(e) Verification Time; (f) Responde Time. 
doi:1 0.1 371 /journal.pone.00971 78.g01 7 



The filtering time of Tree+delta is also greater than that of 
LnGCoding except for Q4. This is because that the query graphs 
contain many cycles in dense graph database, and Tree+delta 
mines too many graph features to its "delta", which is very time 
consuming. 



The mined features of FG-Index and Swiftlndex are not 
effective for dense graph database, they filter out much less false 
positives than LnGCoding. Thus, their filtering time are less than 
that of LnGCoding for all query graph sets. 

Verification Time. Fig. 16(c) shows the verification time of the 
seven methods on the synthetic graph database. From it we know 





Figure 18. Performance on Graphs with Varying Vertex Labels, (a) Coding and Indexing Time; (b) Index Size; (c) Candidate Set Size; (d) 
Filtering Time; (e) Verification Time; (f) Responde Time. 
doi:1 0.1 371 /journal.pone.00971 78.g01 8 
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that, with the decrease of the query graph size, the verification 
time of each method is also increasing. This is because the 
candidate set size of each method is increasing. 

Since Closure-tree follows iGraph's original implementation 
exactly using a Java bytecode analyzer, thus its verification time is 
greater than that of LnGCoding. 

For the coding based index methods, the candidate set size of 
GGoding is slightly less than that of LnGCoding, so its verification 
time is slightly smaller than that of LnGCoding. 

For the substructure based index method Tree+delta, its 
candidate set sizes are less than those of LnGCoding for query 
graph sets Q24, Q20 and Q16, so its verification time is smaller 
than those of LnGCoding on these query graph sets. As for the 
other query graph sets, since the candidate set sizes of Tree+delta 
are greater than those of LnGCoding, its verification time is also 
greater than those of LnGCoding. 

For the other substructure based index methods, their candidate 
set sizes are much more than those of LnGCoding, thus their 
verification time is also greater than those of LnGCoding. Note 
that, the verification time of FG-Index is not the least for query 
graph set Q4, since there are not many frequent features on query 
graph set Q4. 

Response Time. Fig. 16(d) shows the response time of the seven 
methods on the synthetic graph database. 

Since Closure-tree has the more filtering time and verification 
time than those of LnGCoding, thus its response time is bigger 
than that of LnGCoding. 

For the coding based index methods, the filtering time of 
LnGCoding is much less than that of GCoding, thus its response 
time is less than that of GCoding. 

The substructure based index method Tree+delta takes much 
more time to filter out false positives, thus its response time is 
greater than that of LnGCoding except for Q4. 

For the other substructure based index methods, their filtering 
time is much less than that of LnGCoding for Q4, thus their 
response time is less than that of LnGCoding on query graph set 
Q4. As for the other query graph sets, these methods' verification 
time is much greater than those of LnGCoding, thus their response 
time is greater than those of LnGCoding. Thus, the response time 
of LnGCoding is the least among all methods except for query 
graph set Q4, and our method performs best on dense graph 
database. 

In a word, for the synthetic data with dense graphs, LnGCoding 
has the best response time and similar coding and indexing time as 
the fastest methods; FG-Index and Swiftlndex are close compet- 
itors to LnGCoding regarding both evaluation measures. 

From the experiments over both real and synthetic graph data, 
we can find that, although none of these methods outperforms 
others on all the databases, our proposed method does outperform 
competitors when graphs are dense. 

Scalability Test 

In order to evaluate the scalability of LnGCoding, we conduct 
experiments on the synthetic graph data with different sizes and 
distinct vertex labels. 

The synthetic graph data consists of the ten graph databases 
that are generated with a graph generator, which is developed by 
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Kuramochi and Karypis [29] and also used in [18] and [17], by 
varying the cardinality and the vertex labels. Three subsets are 
selected as the query graph sets to test the scalability of our 
method. 

Performance on graphs with varying sizes. In this 
experiment, we generated five databases D5K^ DIOK, D20K, 
D3 OK and D40Khy varying the database cardinality. For database 
Z)/zr(« = 5, 10,20,30,40), (i.e. wxlOOO) graphs are included. 
The query graph sets are QIC, Q15 and Q20, where each query 
graph set Qi consists of 1,000 query graphs with i edges. 

Fig. 1 7 shows the performance of our method on graphs with 
varying sizes. From it we observe that, with the increase of 
database size, the coding and indexing time and index size are 
almost linearly increasing. However, increasing rates of the 
candidate set size, the filtering time, the verification time, and 
the response time are much smaller except for the query graph set 
glO, since its candidate set size grows much faster than those of 
Q\b and 220. This indicates our method performs well on 
databases with different sizes. 

Performance on graphs with varying vertex labels. In 
this experiment, we also generated five databases DIOL, D20L, 
D30L, D40L, D50L by varying the vertex label. For database DnL 
(/2= 10,20,30,40,50), the number of vertex labels is n. The query 
graph sets are QIO, Q15 and Q20, where each query graph set Qi 
consists of 1,000 query graphs with i edges. 

Fig. 18 shows performance of our method on graphs with 
varying vertex labels. From it we know that, with the increase of 
the number of labels, 1) the coding and indexing time and the 
index size are decreasing except for the graphs with 10 labels, 2) 
the trends of the candidate set size, the filtering time, the 
verification time, and the response time are increasing but the 
growth rates are small or very small. This means our method 
works well on the graphs with varying vertex labels. 

Conclusions 

In this paper, we propose a novel graph coding method 
LnGCoding, which utilizes the combination of Laplacian spec- 
trum and the number of walks for subgraph querying over labeled 
graphs. 

Our method first extracts some new graph features, and then 
maps these features into the numerical space to generate the vertex 
and graph codes. A novel index is built to improve the filtering 
efficiency. We also present novel two-step filtering conditions 
taking the properties of graph features into account, and the 
correctness is proved. 

In order to evaluate the performance, extensive experiments on 
both real and synthetic data have been conducted. Experimental 
results show that, compared with the other six methods, our 
method works very well, especially when graphs are dense. 

In the future, we plan using our graph coding method to explore 
similarity graph querying and supergraph querying. 
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